Solution: Divide and Conquer
Let's learn to solve the Spaghetti Query antipattern by using the divide and conquer approach.
A famous quote by William of Ockham also known as the law of parsimony, goes as follows:
The Law of Parsimony
This law essentially states that: “When you have two competing theories that make exactly the same predictions, the simpler one is the better.”
What this means in SQL is that when we have a choice between two queries that produce the same result set, we should choose the simpler one. We should keep this in mind when straightening out instances of this antipattern.
One step at a time#
When we can’t find a logical JOIN
condition between the tables involved in an unintended Cartesian product, it could be because there simply is no such condition. To avoid the Cartesian product, we have to split up a Spaghetti Query into several simpler queries. In the simple example shown earlier, we need only two queries. The query to retrieve count_fixed
is written in the following playground.
The query to retrieve count_open
is written in the following playground. Let’s check the output when we press RUN
.
The results of these two queries are 11
and 7
, as expected.
We may feel slight regret at resorting to an “inelegant” solution by splitting this into multiple queries, but the regret is quickly replaced by relief as we realize that this has several advantages for development, maintenance, and performance. For example:
-
The query doesn’t produce an unwanted Cartesian product, as shown in the earlier examples, so it’s easier to be sure that the query is giving accurate results.
-
When new requirements are added to the report, it’s easier to add another simple query than to integrate more calculations into an already complicated query.
-
The SQL engine can usually optimize and execute a simple query more easily and reliably than a complex query. Even if it seems like the work is duplicated by splitting the query, it may nevertheless be a net win.
-
In a code review or a teammate training session, it’s easier to explain how several straightforward queries work than to explain one intricate query.
Look for the UNION
label#
We can combine the results of several queries into one result set with the UNION
operation. This can be useful if we really want to submit a single query and consume a single result set, for instance, because the result needs to be sorted.
The result of the query is the result of each subquery, concatenated together. This example has two rows, one for each subquery. It is important to remember to include a column to distinguish the results of one subquery from the other, like the status
column, in this case.
We should use the UNION
operation only when the columns in both subqueries are compatible. We can’t change the number, name, or data type of columns midway through a result set, so it is important to ensure that the columns apply to all the rows consistently and sensibly. If we catch ourselves defining a column alias like bugcount_or_customerid_or_null
, we’re probably using UNION
to combine query results that are not compatible.
Solving your boss’s problem#
Let’s return to the problem from the beginning of this chapter. How must you have solved the urgent request for statistics about your project? Your boss said, “I need to know how many products we work on, how many developers fixed bugs, the average bugs fixed per developer, and how many of our fixed bugs were reported by customers.”
The best solution is to split up the work:
- For how many products, use the following:
- For, how many developers fixed bugs, write:
- To find the average number of bugs fixed per the developer, use:
- Finally, to see how many of our fixed bugs were reported by customers, the following may be used:
Some of these queries are tricky enough by themselves. Trying to combine them all into a single pass would be a nightmare.
Writing SQL automatically — with SQL#
When we split up a complex SQL query, the result may be many similar queries, perhaps varying slightly depending on data values. Writing these queries is a chore, so it’s a good idea to use code generation.
Code generation is the technique of writing code whose output is new code we can compile or run. This can be worthwhile if the new code is laborious to write by hand. A code generator can eliminate repetitive work for us.
Multi-table Updates
Executing so many SQL queries or statements may not be the most efficient way to accomplish a task. But we must balance efficiency with getting the task done to achieve the most desirable results.